This lab focuses on creating interactive graphics using plotly, an open-source graphing tool that can interface with R and ggplot.

Directions (Please read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
  2. Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
  3. Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will primarily use the plotly package, but will also require the ggplot2 package.

# load the following packages
# install.packages("plotly")
library(plotly)
library(ggplot2)

The lab’s examples will use the college scorecard data that we’ve previously been working with:

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

\(~\)

Converting existing graphs

Before learning anything about plotly, you should be aware that it is possible to convert a ggplot object into a plotly graphic:

## Store a simple ggplot scatter plot
my_ggplot <- ggplot(data=colleges, aes(x=Cost, y=Salary10yr_median, color = Private)) + geom_point()
ggplotly(my_ggplot) ## Convert

The plotly version of this graph includes the following features:

  1. A dashboard with options to zoom in, zoom out, re-scale, etc.
  2. A tool-tip that displays information when you hover over a data point.

\(~\)

Creating the graph using plot_ly()

The code below demonstrates how to use plotly to create a scatter plot that is colored by a categorical variable:

plot_ly(data = colleges, type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private)
  • type = "scatter" tells plotly to draw a scatter plot
  • mode = "markers" plots the data as hover-able dots (rather than text labels or other symbols)

You should notice plotly uses a ~ character to identify variables from the data provided in the data argument. If it were omitted, plotly would look for a vector called “Cost” in your global R environment.

Additionally, you should notice that this code does not perfectly recreate the graph we made using ggplot and ggplotly.

\(~\)

Comparison of plotly and ggplot

The decision to use plotly or ggplot depends upon the end goal of your visualization, but here are some factors to consider:

ggplot plotly
Easier to construct complex graphics More interactive
Easier customization (colors, etc.) Allows for 3-D graphics
More legible syntax and grammar Allows for animations
Annotations and exporting Can convert ggplot graphics

\(~\)

Lab

Layering in plotly

Similar to ggplot, it is possible to build up a plotly graphic by adding layers via the %>% operator (akin to the + used with ggplot):

plot_ly(data = colleges) %>% 
    add_trace(type = "scatter", x = ~Cost, y = ~Salary10yr_median, color = ~Private) %>%
    add_text(x = ~Cost, y = ~Salary10yr_median, text = ~State)

The example above creates a scatter plot using add_trace(), then it draws a layer of text labels on top of those markers.

The pipe operator allows plotly to be compatible with data wrangling functions from the dplyr and tidyr packages:

colleges %>% 
   filter(State %in% c("IA", "MN", "IL", "WI")) %>%
plot_ly() %>%
   add_trace(type = "box", x = ~Cost, y = ~State)

Typically, the first layer of a plotly graphic is created using add_trace() and the type argument. Additional layers are created using other add_ functions (such as add_text()). This prevents the less important layers from interfering with the hover capacity of the tool-tip.

You can use this reference page for a list of different types of traces (use the navigation drop down menus on the left side of the page).

Question #1: Using add_trace(), create a violin plot that separately displays the distributions of the variable “Enrollment” for private and public colleges in the “colleges” data set. Hint: Use the reference page linked above to determine the proper arguments needed to create this type of graph.

\(~\)

Custom Labels

Perhaps the most appealing feature of plotly is the ability to view a label whenever you hover over a data point or area of interest.

Information can be added to these labels using the text argument in either plot_ly() or add_trace(). For example, we can add the names of each college to our previous scatter plot:

plot_ly(data = colleges) %>%
  add_trace(type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)

Labels are constructed using hypertext markup language (HTML), so their appearance can be modified using HTML commands:

plot_ly(data = colleges) %>% 
    add_trace(type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private,
        text = ~paste0(Name, "<br>", City, ", ", State ))

In this example, paste0() is used to combine fixed character strings with variable values, and the string “<br>” is the HTML command used to begin a new line.

Some other useful HTML commands include:

  • <b> my text <b> - Bolds the text in between the tags
  • <i> my text <i> - Italicizes the text in between the tags
  • x<sub>i<sub> - Adds a subscript, in this case we get \(x_i\)

Question #2: Using the “colleges” data, create a scatter plot of the variables “FourYearComp_Males” and “FourYearComp_Females” that includes a custom label which shows each college’s name in bold text, and also shows on a new line its “PercentFemale” after the character string “percentage female:”.

\(~\)

3-D Graphics

The plotly package is able to create graphics in 3-dimensions. The code below creates a 3-D scatter plot:

plot_ly(data = colleges, type = "scatter3d", mode = "markers",
        x = ~Enrollment, y = ~Cost, z = ~ACT_median)

Because 3-D plotly graphs can be rotated, they tend to be more effective than 3-D scatter plots generated using other packages.

A second useful type of 3-D graph that plotly can create is a surface, which is most often used to display a fitted regression plane.

As an example, consider a linear regression model that predicts the median 10 year salary of graduates based upon a college’s cost and its admissions rate:

model <- lm(Salary10yr_median ~ Cost + Adm_Rate, data = colleges)

Creating a surface to visualize this model involves preliminary two steps:

  1. Creating a grid containing the combinations of predictors we’d like to appear in the graph (ie: setting up the surface’s “x” and “y” scales)
  2. Creating a matrix of model predictions corresponding to each value within that grid (ie: setting up the “z” scale, or the surface’s height)
## Step 1
xs <- seq(0, max(colleges$Cost, na.rm = TRUE), length.out = 100) # Seq from 0 to max cost
ys <- seq(0, 1, length.out = 100)                                # Equal length seq for adm rate
grid <- expand.grid(Cost = xs, Adm_Rate = ys)                    # Grid of every combo

## Step 2
z <- predict(model, newdata = grid)                         # Predictions across the grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE)        # Store predictions as a matrix

## Graph
plot_ly() %>%
  add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate, 
            z = ~colleges$Salary10yr_median, color = I("black")) %>% 
  add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Blues")

This code might seem complicated, but it’s easily adapted to other models and variables simply by modifying xs, ys, and model.

Shown below is the regression surface of a generalized additive model, or GAM, a type of models that allows for non-linear relationships between the predictors and the outcome using spline functions:

library(mgcv)
model <- gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)

Once the new model has been fit, only the matrix of predicted values needs to be updated (since the “x” and “y” variables from the previous example remain unchanged).

z <- predict(model, newdata = grid)                   # Predictions for every combination in the grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE)  # Store predictions as a matrix

## Graph
plot_ly() %>%
  add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate, 
            z = ~colleges$Salary10yr_median, color = I("black")) %>% 
  add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds")

Later in the semester we will discuss methods for determining which of these two models should be preferred.

Question #3: Using this section’s code as a template, add the linear regression surface for the model Debt_median ~ Net_Tuition + ACT_median to a 3-D scatter plot that uses “Net_Tuition” as the x-variable and “ACT_median” as the y-variable. You should use the lm() function to fit this model prior to Step 2.

\(~\)

Customizing Labels

Axis labels in plotly can be modified using the layout() function, while most other scales can be labeled in the function used to create them:

## Plot of the GAM model - gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)

## Graph
plot_ly() %>%
  add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate, 
            z = ~colleges$Salary10yr_median, color = I("black")) %>% 
  add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds", colorbar = list(title = "Salary")) %>%
  layout(scene = list(xaxis = list(title = "Cost"),
                      yaxis = list(title = "Admission Rate"),
                      zaxis = list(title = "Median 10 year salary")))

Documentation for the full set of options in layout() can be found here.

\(~\)

Animation

Most plotly graphics can be made into animations by adding a frame argument, which defines a series of data snapshots that the animation will progress through.

As an example, the code below creates an animated bar chart showing the populations of US states for each year going from 2010 to 2018:

## Load the data
states <- read.csv("https://remiller1450.github.io/data/state_pops.csv")

## Tidy the data
library(tidyr)
library(stringr)
states_long <- pivot_longer(states, cols = 2:ncol(states), names_to = "Year", values_to = "Population")
states_long$Year <- str_replace(string = states_long$Year, pattern = "X", replace = "")
states_long$State <- str_replace(string = states_long$State, pattern = ".", replace = "")

## Animation
plot_ly(data = states_long, type = "bar",
        x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE)

Notice that these data needed to be converted to “long format” for the column “Year” to be used as the frame argument. Additionally, reorder() is used to arrange the states by their initial population (assumed to be their minimum).

Animations can be customized using the animation_opts() function.

## Fast and bouncy animation
plot_ly(data = states_long, type = "bar",
        x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE) %>%
         animation_opts(frame = 100, easing = "elastic", redraw = FALSE)

Within animation_opts(), the frame argument controls the speed at which frames progress. The default is 500 milliseconds, so this animation is 5 times faster than the initial example.

The easing argument implements a transition between frames (in this case an elastic bounce). Different easing options are listed here between lines 68 and 103.

Finally, redraw = FALSE is used to avoid redrawing the entire plot at each frame. In this example it doesn’t make much of a difference, but for larger data sets it can greatly reduce lag.

Question #4: The code below reads a data set compiled by Mother Jones that aims to document all mass shootings in the United States. For this question, create an animated plot that displays the yearly number of fatalities and injuries in these shootings over time. For reference, a sample animation is included below (yours should be similar, but it doesn’t need to be identical). Hint: Before creating the animation you should use group_by(), summarize(), and pivot_longer() to prepare the data.

shootings <- read.csv('https://remiller1450.github.io/data/MassShootings.csv')